Unicode character support in screen tab names#1642
Conversation
|
Fails CI: |
Noted. It should have been |
IDK, but nowhere else do we check for |
|
Any objections to pull in 4ccfc38 to |
I have just one question with the commit: Is it okay for the constants to be defined in BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂 |
Fine. |
|
5f64daf to
fe32490
Compare
| typedef struct WCharEncoderState_ { | ||
| size_t pos; | ||
| void* buf; | ||
| mbstate_t mbState; | ||
| } WCharEncoderState; |
There was a problem hiding this comment.
This should also carry the buffer size, so the internal EncodeWChar routines can perform bound checking.
Also note, that the CGroup code deliberately uses two distinct functions (one for counting the size, another for actually writing) and that these functions return success (character properly counted/written to buffer) or failure (OOB write). Similar could be done here too.
There was a problem hiding this comment.
I choose to make the size counting and the actual writing the same routine for a reason: I expect the "rules" for filtering characters be tweakable in the future. If there are additional characters besides !iswprint that need to be filtered out (replaced with U+FFFD), people only need to modify one routine for that.
I'm no sure about bound checking yet. I can add it into the struct, but it would be for assertions only, as I always expect the buffer allocated would just fit. (That is, no truncation.)
There was a problem hiding this comment.
Update: I've changed my mind about the bound checking here. Considering the standard library functions mbsrtowcs(3) and wcsrtombs(3) both allow limits to the destination buffer size, I can add it to the WCharEncoderState as a feature.
I still don't recommend it though. The string buffer would not be null-terminated when the truncation occurs, and I won't fix that (for multi-byte encodings other than UTF-8, it is allowed that wcrtomb(buf, L'\0', ps) returns a length greater than 1).
There was a problem hiding this comment.
That's what "Failure to write to the buffer" could be used for. If the EncodeWChar function sees that the buffer is full and not NUL-terminated, it should return false (AKA failure), which all callees could propagate accordingly. Similar to what the CGROUP compression code¹ does …
¹Actually that code over-allocates by one byte and forces the NUL-termination in that over-allocated byte. Cf. linux/CGroupUtils.c in functions CGroup_filterName and CGroup_filterContainer.
There was a problem hiding this comment.
@BenBE Nope. After a bit of consideration, this length check cannot be done in EncodeWChar like in the CGROUP code. I wrote a comment on why instead.
It's a limitation of the wcrtomb(3) interface. When we can detect wcrtomb writing out of bounds we might be too late (access violation already happens). wcrtomb doesn't give us the "length check before writing" we need. Trying to workaround that would make code unnecessarily complicated (wcrtomb called one more time - three times in total).
Also, we cannot put the '\0' byte in arbitrary position. Unlike in the CGROUP code, we can only put '\0' on character boundaries here. So overall there's nothing I can do.
There was a problem hiding this comment.
Yes. And with the implementation used in the CGroup code, a false returned from the buffer write callback would immediately terminate processing. That's basically the same semantics used there too.
There was a problem hiding this comment.
@BenBE That means an additional memcpy call, slowing things down slightly. For checking an error that should never happen unless there's a bug in code. I don't really like this.
There was a problem hiding this comment.
That memcpy is from L1 cache to a buffer that's likely in L1 too. As mentioned elsewhere: If in doubt, I prefer correctness over performance.
And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.
There was a problem hiding this comment.
And if you want to avoid this, over-allocate by
MB_LEN_MAX + 1and you can always write tops->buf, if you force the NUL terminator on error.
Over-allocation to solve the (actually non-existent) problem of under-allocation. Sorry, but I'm irritated by this logic.
The whole point of adding the bound check by your suggestion was to correct the case of under-allocation (allocated buffer size is less than required). While I argued first that the problem literally won't exist, I agree on the point that we can assert and detect the error, at least as a precaution. The buffer should be allocated with the right size already, thus any code that attempts to correct or recover from any potential error is useless. I might just make the error fatal instead and call it a day.
if (ps->buf && len <= ps->size - ps->pos) {
fail();
}As for why there is no NUL termination for the string when odds happen here: We are using a multibyte encoder wcrtomb that is capable of handling any encoding, not just UTF-8. When interpreting the standard behavior of wcrtomb in the strict sense, it is not necessary for a null wide character to be encoded as 1 byte. The standard wcrtomb will output shift sequences (if any) that restore the shift state to the initial state before outputting the null byte itself. Since that would require extra buffer space (up to MB_CUR_MAX bytes), any truncation of a multibyte string would have to be decided beforehand, and it won't be an escape hatch for any buffer size problem.
There was a problem hiding this comment.
ACK on making the error fatal.
Anyway, a good API makes it impossible to cause issues by abusing it. Thus if wcrtomb may write up to MB_LEN_MAX bytes you better should pass a buffer that can hold MB_LEN_MAX bytes, i.e. ensure the remaining space is large enough. Even if the actual call is known to write less.
And adding MB_LEN_MAX + 1 to the allocation won't cause any noticeable overhead (malloc will just allocate the same and in rare occasions or under memory pressure may delay due to the OOM killer sweeping some swap …
72b854a to
2ca232d
Compare
ef4f749 to
f4a21e7
Compare
b2db29d to
b61a964
Compare
41331a0 to
f7a1a80
Compare
d881071 to
d1713e5
Compare
| for (unsigned int i = 0; i < settings->nScreens; i++) { | ||
| const char* tab = settings->screens[i]->heading; | ||
| int width = rem >= bracketWidth ? (int)strnlen(tab, rem - bracketWidth + 1) : 0; | ||
| const char* ptr = tab; |
There was a problem hiding this comment.
Why introduce this extra variable? Was this patch written by AI? ;-)
There was a problem hiding this comment.
Technically no AI is used :)
The ptr is a temporary variable holding a pointer because of the API design I explained in https://github.com/htop-dev/htop/pull/1642/changes/BASE..6b82bcef3b8f7aecd9be6c83d66e84af152a1d58#r3211053429
| return buf; | ||
| } | ||
|
|
||
| bool String_decodeNextWChar(MBStringDecoderState* ps) { |
There was a problem hiding this comment.
Naming consistency String_encodeWChar vs. String_decodeNextWChar.
Also, any reason to have this function exported from this TU?
There was a problem hiding this comment.
There's a reason for asymmetric naming here. (You may suggest a better naming.)
encodeWChar() and decodeNextWChar() are for low-level encoder and decoder with states.
For the decoder:
bool String_decodeNextWChar(MBStringDecoderState* ps);The input string is specified as ps->str and the next character decoded will be outputted in ps->ch.
For the encoder:
/* The typedef is to allow different encoders to coexist. I have thought of an encoder that would present all non-printable characters as "\xXXXX" escape sequence. */
typedef ATTR_NONNULL void (*EncodeWChar)(WCharEncoderState* ps, /*wchar_t or int*/ wc);
void String_encodeWChar(WCharEncoderState* ps, /*wchar_t or int*/ wc);The next character that would be encoded is specified as wc argument here and will be appended to the end of ps->buf. The WCharEncoderState object does not otherwise know the next wchar_t to encode.
Perhaps the name String_encodeWChar is obscure on what it does. Maybe the name WCharEncoder_appendWChar will be better? I'm not sure yet.
| return totalWidth; | ||
| } | ||
|
|
||
| int String_mbswidth(const char** str, size_t maxLen, int maxWidth) { |
There was a problem hiding this comment.
Why have the public API of this function modify an input argument to move to the end of the function?
There was a problem hiding this comment.
This function has two outputs: One is the terminal width (i.e. num of terminal columns) and the other is the end character position after calculating the width.
The second output is useful to tell, for example, how many Unicode characters would it take to fit the number of terminal columns.
This functions has two uses combined in a single API: (1) to count the number of terminal columns for a given multibyte string; (2) to count the number characters it takes to fit the given terminal width.
There was a problem hiding this comment.
It might seem redundant, but I'd prefer if this output parameter was a separate (NULL-able) argument to the function. This separates concerns (input argument vs. optional additional output) and makes the API clearer. This also allows to just skip the extra local variable in case you don't care AND don't have it unexpectedly be changed.
There was a problem hiding this comment.
It might seem redundant, but I'd prefer if this output parameter was a separate (NULL-able) argument to the function. This separates concerns (input argument vs. optional additional output) and makes the API clearer. This also allows to just skip the extra local variable in case you don't care AND don't have it unexpectedly be changed.
A static inline function that wraps around this String_mbswidth() for a more convenient output interface? That wouldn't hurt, I guess.
By the way, the "input pointer that would be modified on output" is not my original idea. There are precedents in mbsrtowcs(3) and mbsnrtowcs(3) (compare mbsrtowcs(3) with mbstowcs(3) and I would argue that the const char** argument in the former makes the API more flexible).
There was a problem hiding this comment.
I gave both variants of the API to Copilot to compare (without initially disclosing what the function would do) and except for "if it runs in a loop for parsing" the additional out argument was the preferred variant. This only solidified further once I gave Copilot also some more semantics of the function it just had rated. One interesting aspect was that modern compiler tend to optimize the additional out parameter variant better. So there's several aspects that lean towards that variant of the API.
Using the C standard library's string function API as a baseline for API quality sets a very low bar for quality … ;-)
There was a problem hiding this comment.
I gave both variants of the API to Copilot to compare (without initially disclosing what the function would do) and except for "if it runs in a loop for parsing" the additional out argument was the preferred variant.
What variants of the API did you give to Copilot in particular?
One interesting aspect was that modern compiler tend to optimize the additional out parameter variant better. So there's several aspects that lean towards that variant of the API.
Again I'm not sure what variant you are referring to.
Using the C standard library's string function API as a baseline for API quality sets a very low bar for quality … ;-)
I personally value utility of the API over the "quality" which I'm not sure what it is. Convenience of passing the arguments and calling the function? Register and stack usage?
It's not I disagree with changing the API, but I didn't see an API suggestion to compare with, thus I cannot judge anything regarding what Copilot said.
| typedef struct WCharEncoderState_ { | ||
| size_t pos; | ||
| void* buf; | ||
| mbstate_t mbState; | ||
| } WCharEncoderState; |
There was a problem hiding this comment.
ACK on making the error fatal.
Anyway, a good API makes it impossible to cause issues by abusing it. Thus if wcrtomb may write up to MB_LEN_MAX bytes you better should pass a buffer that can hold MB_LEN_MAX bytes, i.e. ensure the remaining space is large enough. Even if the actual call is known to write less.
And adding MB_LEN_MAX + 1 to the allocation won't cause any noticeable overhead (malloc will just allocate the same and in rare occasions or under memory pressure may delay due to the OOM killer sweeping some swap …
ffb876d to
7a186ea
Compare
📝 WalkthroughOverviewAdds UTF-8/Unicode-aware terminal-width handling for screen tab names so multibyte, zero-width and invalid sequences are measured and rendered correctly. Changes
Implementation qualityGood encapsulation: all Unicode/multibyte logic is contained in XUtils with minimal, targeted callsite changes. Two-pass encoding and explicit decoder state handling show care for invalid/incomplete sequences. Conditional compilation is used correctly for wide-char vs byte builds. The new code is nontrivial and requires careful review (high review effort). Commit structure & flowCommits are sensibly split: utility/types and helpers first, then callsite changes. This split aids review and bisecting. Cleanliness & scopeFocused, low-risk callsite changes; APIs are reusable and documented. Overall clean implementation. Known issues / review notes
WalkthroughThis PR adds multibyte/Unicode string utilities to XUtils (encoding, printable filtering, stateful decoding, line-break and mbswidth measurement). Action.c and ScreenManager.c were updated to use String_mbswidth when measuring tab labels so drawing and hit-area calculations use terminal display-column widths and render only the fitting byte span. Poem
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7f12d758-cdea-480b-bbce-51bb9eced430
📒 Files selected for processing (4)
Action.cScreenManager.cXUtils.cXUtils.h
| case (size_t)-1: | ||
| // Invalid sequence | ||
| ps->ch = WEOF; | ||
| return false; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify callers that may loop on String_lineBreakWidth/String_mbswidth
# and require forward progress when malformed bytes are present.
rg -n -C3 '\bString_lineBreakWidth\s*\(|\bString_mbswidth\s*\('Repository: htop-dev/htop
Length of output: 2394
🏁 Script executed:
# Get the size of XUtils.c first to determine reading strategy
wc -l XUtils.cRepository: htop-dev/htop
Length of output: 70
🏁 Script executed:
# Read the decoder function around lines 411-415
sed -n '385,420p' XUtils.cRepository: htop-dev/htop
Length of output: 969
🏁 Script executed:
# Read the String_lineBreakWidth function around lines 442-480
sed -n '442,480p' XUtils.cRepository: htop-dev/htop
Length of output: 1278
🏁 Script executed:
# Read the full String_decodeNextWChar to understand the complete context
sed -n '370,440p' XUtils.cRepository: htop-dev/htop
Length of output: 1725
🏁 Script executed:
sed -n '442,533p' XUtils.cRepository: htop-dev/htop
Length of output: 2586
Consume bytes on invalid multibyte sequences to ensure forward progress.
Invalid UTF-8 at line 411 returns false without advancing ps->str, causing String_lineBreakWidth to return without updating *str (line 495 is unreached). Callers retrying on the same pointer risk infinite loops. The incomplete sequence case (lines 418-421) correctly advances pointers; invalid sequences must do the same.
Proposed fix
switch (len) {
case (size_t)-1:
- // Invalid sequence
- ps->ch = WEOF;
- return false;
+ // Invalid sequence: consume one byte to guarantee progress
+ ps->str += 1;
+ ps->maxLen -= 1;
+ ps->ch = WEOF;
+ return false;
case (size_t)-2:And in String_lineBreakWidth, update *str on decode failure:
- if (!String_decodeNextWChar(&state))
+ if (!String_decodeNextWChar(&state)) {
+ *str = state.str;
break;
+ }| #endif | ||
|
|
||
| ATTR_NONNULL_N(1, 4) ATTR_ACCESS2_W(1) ATTR_ACCESS3_R(2, 3) | ||
| void EncodePrintableString(WCharEncoderState* ps, const char* src, size_t maxLen, EncodeWChar encodeWChar); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Rename EncodePrintableString to module-prefixed style.
This new public function name breaks the project’s C naming convention and makes the API inconsistent with surrounding String_* utilities.
As per coding guidelines, use ModuleName_functionName() naming convention for C functions.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
XUtils.c (1)
410-420:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuarantee forward progress on malformed multibyte input.
String_decodeNextWChar()leaves the decoder on the same byte forEILSEQ, andString_lineBreakWidth()drops any decoder progress whenString_decodeNextWChar()returnsfalse. A caller that retriesString_mbswidth()on the returned*strcan stall forever on malformed or truncated input.Proposed fix
switch (len) { case (size_t)-1: - // Invalid sequence + // Invalid sequence: consume one byte and reset the shift state. + ps->str += 1; + ps->maxLen -= 1; + memset(&ps->mbState, 0, sizeof(ps->mbState)); ps->ch = WEOF; return false; case (size_t)-2: // Incomplete sequence ps->str += ps->maxLen; ps->maxLen = 0; return false; @@ - if (!String_decodeNextWChar(&state)) + if (!String_decodeNextWChar(&state)) { + *str = state.str; break; + }Also applies to: 475-476
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 06c427dd-1b32-423f-9c68-3f85a38430d2
📒 Files selected for processing (4)
Action.cScreenManager.cXUtils.cXUtils.h
|
@coderabbitai The suggestion of The safer approach is to restore the |
|
Tip For best results, initiate chat on the files or code changes.
The correct pattern is to snapshot mbstate_t savedState = ps->mbState;
size_t rc = mbrtowc(&wc, src, remaining, &ps->mbState);
if (rc == (size_t)-1 || rc == (size_t)-2) {
ps->mbState = savedState; /* restore shift state */
/* handle invalid/incomplete sequence */
}This preserves the shift state for the continuation of the string. I'll record this correction. ✏️ Learnings added
|
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 4cdabc95-6f5b-4929-8efd-c4909a26881e
📒 Files selected for processing (4)
Action.cScreenManager.cXUtils.cXUtils.h
* String_makePrintable() * EncodePrintableString() * String_lineBreakWidth() * String_mbswidth() Signed-off-by: Kang-Che Sung <explorer09@gmail.com>
Compute the terminal widths of screen tab names using String_mbswidth(). No longer assume each character takes one terminal column. Signed-off-by: Kang-Che Sung <explorer09@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
XUtils.c (1)
478-479:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve pointer progress when decoding fails.
At Line [478], breaking immediately on
MBStringDecoder_nextWChar()failure drops decoder progress (notably the incomplete-sequence path that already advanceddecoder.str), so callers can reprocess the same bytes.Proposed fix
- if (!MBStringDecoder_nextWChar(&decoder)) - break; + if (!MBStringDecoder_nextWChar(&decoder)) { + *str = decoder.str; + break; + }#!/bin/bash # Verify failure path behavior and whether caller updates *str on decode failure. rg -n -C3 '\bbool MBStringDecoder_nextWChar\b|\bString_lineBreakWidth\b|if \(!MBStringDecoder_nextWChar\(&decoder\)\)'
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: d7294559-8c8a-42f8-99f7-04c4f0b544ad
📒 Files selected for processing (4)
Action.cScreenManager.cXUtils.cXUtils.h
The stopping at the first byte of the invalid sequence is intentional. How the invalid byte sequence would be handled is left to the caller. |
I would like to share a feature that I have done as part of a code refactoring process.
These changes are still work in progress, but I think it's good to file it and gather early feedback.
This feature makes the screen tab names display with the correct terminal widths when they contain Unicode characters (in UTF-8 encoding). It replaces the naive
strlencalculation with a new function that calculates the display terminal width of a name string.The PR introduces the following functions:
String_makePrintable, a sanitization function with an API similar toxStrndup. It reads the string, converts all unprintable characters to replacement code points (U+FFFD), and allocates and re-encodes the string.EncodePrintableString, an internal routine forString_makePrintable. This routine is designed with room to supportRichStringbuffersString_mbswidth, a function that calculates the terminal width of a string, or calculate the byte length of a substring if it is clipped by a terminal width limit.This PR is related to #1513.